Unsupervised Pattern Discovery in Biosequences Using Aligned Pattern Clustering
نویسندگان
چکیده
Protein, RNA and DNA are made up of sequences of amino acids/nucleotides, which interact among themselves via binding. For example, (1) protein-DNA binding regulates gene transcription [1]; and (2) Protein-protein binding plays important roles in cell cycle control and signal transduction [2].The binding is maintained by either the direct participation or assistance of conserved short segments of biosequences called functional elements. Because of their importance in preserving function, they are well conserved throughout evolution. Their recognition is therefore essential for an in-depth understanding of the biological mechanisms [3] such as inhibitor design [4]. Although these functional elements could be discovered from the three-dimensional structural forms of the biosequences, the applicability is limited due to the high experimental cost. With the advent of new sequencing technologies [5], it is preferable to discover, directly from the abundant biosequence data, functional elements where many of them are short with variable length, like Short Linear Motifs (SLiMs [6] ) which play important roles in protein-protein interaction but are only 3 to 15 amino acids in length. Such short elements could not be captured well by the popular position weight matrices [7]. In this paper, we aim to briefly review an unsupervised pattern discovery tool known as Aligned Pattern Clustering (or its software WeMineTM) [8-11] which is developed to facilitate the discovery and analysis of patterns in biosequences. Its applications include 1) identifying functional elements in protein sequences [8-11,2] revealing functioning subgroup characteristics of functional elements [12-14,3] identifying co-occurring intra-protein [15,16], inter-protein [17] and proteinDNA functional elements [18,19].
منابع مشابه
Comparison Between Unsupervised and Supervise Fuzzy Clustering Method in Interactive Mode to Obtain the Best Result for Extract Subtle Patterns from Seismic Facies Maps
Pattern recognition on seismic data is a useful technique for generating seismic facies maps that capture changes in the geological depositional setting. Seismic facies analysis can be performed using the supervised and unsupervised pattern recognition methods. Each of these methods has its own advantages and disadvantages. In this paper, we compared and evaluated the capability of two unsuperv...
متن کاملKnowledge Discovery in Biosequences Using Sort Regular Patterns
This paper considers knowledge discovery by sort regular patterns, which are strings over sort letters representing nite sets of basic letters. We devise a learning algorithm for the class based on the minimal multiple generalization technique, and evaluate the method by experiments on biosequences from GenBank database. The experiments show that relatively a simple sort pattern can represent a...
متن کاملSteel Consumption Forecasting Using Nonlinear Pattern Recognition Model Based on Self-Organizing Maps
Steel consumption is a critical factor affecting pricing decisions and a key element to achieve sustainable industrial development. Forecasting future trends of steel consumption based on analysis of nonlinear patterns using artificial intelligence (AI) techniques is the main purpose of this paper. Because there are several features affecting target variable which make the analysis of relations...
متن کاملReports in Informatics Approaches to the Automatic Discovery of Patterns in Biosequences
Approaches to the automatic discovery of patterns in biosequences. Abstract This paper is a survey of approaches and algorithms used for the automatic discovery of patterns in biosequences. Patterns with the expressive power in the class of regular languages are considered, and a classiication of pattern languages in this class is developed, covering those patterns which are the most frequently...
متن کاملReports in Informatics Relation Patterns and Their Automatic Discovery in Biosequences Relation Patterns and Their Automatic Discovery in Biosequences
We have extended the pattern language used in PROSITE to enable it to describe dependencies between amino acid residues. We have developed a minimum description length principle based tness measure evaluating the signiicance of such patterns in relation to a set of sequences, and an algorithm automatically nding signiicant patterns in unaligned sequences. Computing experiments are reported show...
متن کامل